
简短版介绍(PPT)            

        The Chaos of Legacy Character Models
        ASCII        
        ASCII or
        American Standard Code for Information Interchange. 
        by ASA (American Standards Association),
        today known as ANSI or
        American National Standards Institute,
        There had been other character codes prior to ASCII like
        EBCDIC used on IBM mainframes and
        codes used for teletypes dating back to Morse code,
        but ASCII was adopted on the PC and  spread with it.
        Today,
        ASCII is the base for all major character sets and
        all of the character sets discussed in this paper retain some form of compatibility with ASCII.                                             
         
        
        At the time ASCII was developed,
        computational resources,
        especially memory had been very expensive and
        also for it's intended use as telecommunication,
        it was restricted to the bare minimums.
        The 1st version of ASCII defined in 1963 [ASCII63] did ¿only have capital letters¿
        Of course this short-coming was quickly noticed and
        ASCII was extended in 1967 by  the ECMA,
        the European Computer Manufacturers Association,
        as [ECMA6] and later adopted as [ASCII67],
        containing 94 visible characters,
        the space character and
        32 control characters like delete,
        escape and
        line feed)
        encoded in 7-bits,
        as shown in Figure
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        Because the only languages that could be fully written
        using the characters found in ASCII were
        Latin,
        Swahili,
        Hawaiian and
        American English,
        [ECMA6] defines the ASCII-compatible character set as
        the International Reference Version (IRV)
        and allows 10 of the lesser used code points of ASCII
        # $ @ [ \ ] ^ ` { | } ~ 
        to be substituted by ¿national variants¿.
        For example
        the German variant,DIN 66003,
        substitutes these with the "Umlaut"-characters (ä ö ü Ä Ö Ü)
        the Eszett character (ß),
        keeping the $ # @ and BACKSP  ACE characters (Fig.  2).
        The Japanese variant,
        JIS X 0212,
        on the other hand,
        for which
        the possible substitutions was insufficient anyway to be used for Japanese,
        replaced only the BACKSPACE character with the YEN currency symbol and
        TILDE with WAVE DASH.
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        In 1972,
        the International Organization for Standardization (ISO)
        collected the major national variants and
        published them as [ISO646].
        The IRV was adopted as ISO-646-IRV and
        the national variants were called ISO-646-xx,
        like ISO-646-DE for German and
        ISO-646-JP for Japanese.
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        By  using the national variants,
        basic demands of many countries and
        languages using the alphabet
        could be fulfilled but the situation was far from being satisfiable.
        Using the variants,
        usually only one language could be properly represented at the same time and
        there was no standard way to indicate
        which variant a speci fic text was using.
        
        
        
        
        European Languages: ISO-8859
        
        
        
        
        In order to handle even only West European languages like German,
        French,
        Italian and  British English with all their accent characters,
        at the same time,
        the 128 code points of 7-bit
        ASCII were obviously not enough.
        Fortunately,
        computers handle data in units of 8-bits,  called bytes.
        ASCII being a 7-bit code,
        the 8th bit was sometimes used for controlling purposes but was usually set to 0,
        because  it was very inconvenient,
        from a programmer's view,
        to squeeze 8 7-bit ASCII characters into 7 bytes.
        So, if this 8th bit was utilized,
        the number of characters that a single byte could represent would double from 128 to 256.
        
        
        
        
        the ECMA started to design character sets that would be 8-bit and
        be able to represent multiple European languages.
        A basic design principle was to be compatible to ASCII,
        so it was decided that
        code points 0x00 - 0x7f
        would be bit-by-bit compatible with ASCII and
        code points 0x80 - 0xff would contain additional characters.
        
        
        
        4 character sets,
        one each for West,
                     Central (Eastern),
                     South and 
                     North Europe
            called Latin alphabet No.1,2,3,4.
        Again,
        as with [ECMA6] and [ISO646] 
        ISO adopted the various Latin alphabets as [ISO8859].
        Later,
            more alphabets with the same design philosophy were added.
        As the Western Europe Latin alphabet,
            called Latin-1 or
                  ISO-8859-1 found wide adaption in West European countries,
            it became apparent that
                some characters were missing from it and
                others were less needed.
        So Latin alphabet No.9 (ISO-8859-15)
            was defined as successor of Latin-1,
                replacing some characters and
                adding others,
                especially the € sign,
                which didn't exist at the time [ECMA94] was created.


        several of the character sets can't be inter-mixed
        without special handling by
        applications.
        Also,
        many systems,
            especially on the Internet,
            can't handle 8-bit characters cleanly,
            as they had been designed only with 7-bit ASCII in mind.
        If an 8-bit text passes such an system,
        usually the 8th bit is zeroed and
        all characters originally having a code value higher than
            0x7f will be mangled.
        To prevent such damage,
        an character encoding scheme must be applied to the 8-bit text.
        For ISO-8859,
            it is common to encode characters
            from the 0x80 - 0xff range using
            the "quoted-printable" encoding [RFC1521].
                In this scheme,
                    the character to be encoded will be represented by
                    3 characters from the ASCII character set.
                The = (equal)  sign to signal the beginning of an encoded character and
                the character's hexadecimal code value.
        For this to work,
            the = equal sign used as encoding delimiter needs to be encoded,  too.
            (Fig.  5)

   CJK 
        Why these 3 East Asian languages?
        They all use ideographic characters.
        Ideographic characters differ from phonetic characters,
        like the Latin alphabet,
        that they do not only have a pronunciation but also have a meaning themselves
        (Fig.  6 shows some examples).
        The ideographs used in all 3 countries date back to
            the Chinese Han dynasty
            but have since  have evolved separately.
        In Japanese they are called Kanji,
        in Chinese Hanzi and
        in Korean Hanja.
        Note the similarity of the names,
            tough the spoken languages are totally different today.
        In addition to the ideographs,
            Japanese and
            Korean also have their own phonetic拼音  scripts,  too.
        Because the ideographs are like words themselves,
            there are many many more of them than
            256 that would fit into an 8-bit character set.
        Actually it is said that you need to to know roughly 2,000 ideographs to be able to read a Japanese newspaper.
        In order to properly write people's names and  names of places,
            computer systems need to handle more than  20,000 ideographs at the minimum.
        In Chinese,
            the number is even higher.
            [UNIHAN] is a large database of these ideographs
            together with sample glyphs and  their meaning.


        In order to handle the large number of characters,
        multiple bytes were needed to represent a single character.
        In 1976,
            ¿JIS X 0208¿ was the 1st standardized character set that used 16-bits (=2bytes)
            for each character.
        It was designed for Japanese,
        but also contained basic Latin,  Greek and
        Cyrillic alphabets as well as
        many symbols besides the Japanese Hiragana and
            Katakana alphabets and
        the most important Kanjis.
        The idea was to make the character set convenient for every-day use by
        Japanese businesses.

        Though JIS X 0208 could theoretically contain 2^16 = 65,536 characters,
        in order to
        maintain compatibility with ASCII,
        JIS X 0208 was organized in 94 rows of 94 cells each,
        so that
        they could be mapped over the 94 graphical characters of ASCII.
        So actually only 94 x 94 = 8,836 characters are included.
        Later,
            additional characters were added as extentions to JIS X 0208.
        Fig.  7 shows a single row of JIS X 0208.



        For Chinese,
        mainland China defined GB 2312 for
        Simplified Chinese characters in the same 94x94 layout as JIS X 0208.
        Taiwan on the other hand,
            defined it's own character set CNS 11643 Plane 1
            for Traditional Chinese characters.
        To make matters worse,
            another character set for Traditional Chinese,
            Big5,  an industry standard,
            is being widely used.
        (see [WITTERN] for an overview on Chinese character codes)


        Now,
        For a byte stream containing text in multi-byte character codes,
            the text needs to be specially encoded to be recognized as such.
        Otherwise it would not be possible to
            decide where character boundaries are or
            whether a specific byte is
                part of a multi-byte character
                or  a single ASCII character.
        Very often multiple encodings exist
            for a single character set.
        Even only for JIS X 0208,
            there are 3 possible encodings that are widely used:
                raw JIS,
                EUC-JP and
                Shift-JIS.
            In the raw JIS encoding,
                the 1st byte has the ¿raw value¿ of the row and
                the 2nd byte has the ¿raw cell value¿ of the character;
                    distinction from an stream of ASCII character is impossible.

        EUC-JP 
        (Extended Unix Coding)
            just sets the 8th bit that is  not used by  ASCII
            to distinguish multi-byte characters. 

            As the name implies,
            EUC-JP is widely used on UNIX environments.

        Another encoding scheme is Shift-JIS,
        also known as MS-Kanji,
        because  it was developed by
        Microsoft for use in it's operating systems.

        In order to add extra 64 characters,
            the characters from JIS X 0208 are
            "shifted" 64 code points and
            are reorganized as
                47 rows with 188 cells each. 
        [PING] gives an simple overview of these encodings.

        Korea also has 2 character sets in wide use.
            The national standard KS C 5601 which
                again has been modeled after JIS X 0208 and UHC,
                ( Unified Hangul Code)
                Tough UHC was designed as a superset for KS C 5601,
        they have multiple totally different encoding schemes.



        Confused?
        Then imagine the nightmare of deciding which
        encoding a byte-stream of text is in and
        converting between  different encodings.

        Or what happens when you don't even know which  language,
        and thus which  potential encoding,
        a text is supposed to be in?
        Tough there exist methods to identify character sets and
        encodings ([RFC2278])
        as well as ways to specify these 

        todays information,
        especially on the web is poorly labeled,
        and a mixture of wild guessing and  heuristics is used to
        determine a documents language and  encoding,
        not even elaborating on the difficulty to create multi-lingual documents.



        With the wide-spread use of 8 and
        16-bit character sets,
        most languages could be represented.
        But still,
            handling of multiple languages was limited to
            those included in a single character set.
        In order to use use 2 or  more languages
            that didn't share a common character set simultaneously,
        a new and  (most likely)  incompatible character set
        had to be created.

        Instead of creating dozens of new character sets,
        ISO-2022 ( [ISO2022])
            defines a mechanism to use multiple character sets simoultanously by
            switching between  them using escape sequences,
        Again,  there is an identical ECMA standard,  [ECMA35].


        ISO-2022 divides the 256 code points of a single byte into 4 areas:
            CL (Control Left,  primary control characters),
            GL (Graphical left,  graphical characters),
            CR (Control Right,  secondary control characters)
        and GR (Graphical Right,  graphical characters),
        mapped over ASCII's layout of control and  graphical characters.
        (Fig. 8).

        At the beginning of an ISO-2022 encoded text,
        up to 4 character sets are designated as G0,  G1, G2 and G3,
        using escape sequences,
        each starting with the ESCAPE control character followed by
            one or  more bytes.
        Then 2 of the character sets are assigned to either GL or  GR  area as needed.
        Depending whether a byte has it's 8th bit set or not,
            it is clear to which  character set it belongs.
        A text can also be simply encoded into 7-bits for transfer over the Internet by
            utilizing only the GL area and
            switching the character set assigned to GL.
        Fig.  9 shows an example
            text containing US-ASCII and
            JIS X 0208 characters encoded in ISO-2022.


        Additionally,
        the current character set can explictly switched,
        either only for the next character or
        until  switched again.
        Should more than  4 character sets be needed,
        the Gx areas can be reassigned.
        The character sets are identified in the escape sequences by
        codes assigned to them in [ISOESC].

        ISO-2022 was supposed to become the one encoding to rule them all,
        the many character sets,
        but isn't being used as widely as had been expected at the time of it's design.
        ISO-2022's complexity is the main reason for it's failure.
        Being an stateful encoding,
            switching from one character set to another,
            processing of ISO-2022 encoded sequences of bytes is non-trivial.
        In order to know which  character set
            us used for a given byte,
        the whole sequence must be analyzed.
        To make matters worse,
            if a bytes or
            even a single bit of an ISO-2022 encoded text would
                get lost or  corrupted,
        the whole meaning of the text could change,
        especially if an escape sequence was damaged.
        Additionally,
        applications must have knowledge of all character sets they expect to encounter in an ISO-2022 encoded text.

unicode
        Also,
        with the global spread of personal computers,
        software manufactures became more and
        more stressed.
        Software was usually developed in English,
        with only English and
        8-bit character sets in mind.
        After the release of the English version,
        it often took several month to a year for a company to provide localized versions of their software,
        because
        they not only had to translate all the text of the application,
        but also change the operation semantics of the software according to local needs.
        And parallel localization work in multiple countries cost real money,
        very often even more than  the initial development cost of the software.
        And that for every new version of the software.
        As a result,
            customers in non-English countries felt discriminated because
            their software was more expensive,
            out-of-date and
            usually badly localized because
            of limits in the original software.


        In the late 1980s,
        the idea of an Universal Character Set started to emerge at research centers of various computer firms.
        In 1987,
        the term "Unicode" was first used in course of discussions.
        The main difficulty in creating such a character set was,
        ¿that the requirements were not clearly known¿.
        After thorough research on the world's characters,
        the "Unicode 1.0" standard was published by
        the Unicode Consortium with 4 main design principles,
        learning from legacy character sets.

        Universal
            enough characters must be included, so that all characters used in day to day operations wold-wide

        Efficient
            the character code must be simple enough that computer systems can easily implement it

        Uniform
            the character code shall be uniform,
            so that sorting, searching, displaying and editing of text can be done efficiently without special exception rules

        Unambiguous
            any given code value always represents the same character


        At roughly the same time,
            ISO started developing an international standard
                with the same goals what would later become ¿ISO/IEC 10646¿.

            After the release of ¿Unicode¿ 1.0 in 1991,
                both efforts realized that
                    having 2 different,
                    incompatible but universal character sets
                        would be senseless
                        and  merged their work,
            so that
                both standards share the same repertoire of characters using identical code numbers.
            Today the standards are strongly linked so that
            in various technical documentation,
            very often one or  the other is used as reference.


        After the merge with ISO/IEC 10646,
        and the release of Unicode 1.0.1 and
        ISO/IEC-10646-1:1993,
        both standards continued to evolve together
            towards their goal of a Universal Character Set,
        adding more characters and
        clarifying various issues such as algorithms and  encodings as needed.
        Today,
            the newest revisions are ¿ISO/IEC-10646-2:2000 and  Unicode 3.2¿,
        the latter including more than  90,000 characters.
        Fig.  10 summarizes the history of Unicode and
        Fig.  11 shows number of characters included in major releases of Unicode.



        When the Unicode Standard started to take form,
            the formation of the Unicode Consortium was announced,
            and shortly thereafter incorporated as
            the non-profit organization Unicode,  Inc.
        in 1991.
            The consortium's mission is to define Unicode characters,
            their relationship to each other and
                provide technical information and
                guidelines to implementers of the Unicode Standard.

        The consortium funds itself through
            sale of the standard in printed form and
            fees from it's members,
        which include prominent computer hardware and
        software manufactures like IBM,
            Hewlett-Packard,
            Oracle,
            SAP, Adobe,
            Apple and
            Microsoft,
            just to name a few.
        Individual can join either as Specialist or  Individual members,
            neither having voting rights
            but the former having full access to all members-only  documents.

        Additionally,
        the consortium has liaison relationships with other national and
        international standardization bodies
            like the Internet Engineering Task Force (IETF),
            the World Wide Web Consortium (W3C),
            the High Council of Informatics of Iran and
            the several joint work groups of
                the International Organization for Standardization (ISO)
                and International Electrotechnical Commission (IEC)
            working on internationalization.
        Especially the relationship with ISO/IEC JTC1/SC2/WG2 is important as they work closely with Unicode on the ISO/IEC 10646 standard.



        In order to realize the ambitious idea of an easily implementable Universal Character Set,
        the creators of Unicode made a simple  but well-toughtout set of design decisions.
        Compatibility with legacy character sets  and  ease migration.
        For this purpose,
            some decisions were made against the ultimate goals 
            but paving the road for fast adaption of Unicode was given priority.


        16-bit


        Each character in Unicode's character repertoire
        is assigned an unique number
        making Unicode a character set.
        To have enough space for all the characters in the world,
        each character's code is 16-bits long,
        allowing for a total of 65,535 code values.
        A character's value is noted in it's hexadecimal form with the 'U+' prefix.
        For example,
            the code of LATIN CAPITAL LETTER A is U+0041.

        Deciding on a multiple of 8 was logical in the context of byte-oriented computer systems and
        a 24-bit code seemed unneccersary and
        an hindering factor in Unicode adaption.
        requiring 3 times as much space for storage as legacy codes from an American/European view,
        despite heavy protests from East Asian countries,
        which knew from the experience with their own local 16-bit character sets,
        that 16-bits would be insufficient even for their own needs.

        Unsurprisingly,
        after the 1st implementation of Unicode were introduced into the market and
        the products could not be decently used in East Asian markets,
        it finally became apparent to Unicode designers
            that the 16-bit code space was indeed insufficient.
        Beginning with Unicode 2.0,
        an extention mechanism was introduced that
            allowed additional 16*2^16 characters to be added to Unicode by
            using 2 16-bit values,
            called a surrogate替代品 pair,
                to represent the additional character.
        This way,
        Unicode still is a 16-bit character set.

        The new design of Unicode divides the expanded code points
            U+000000 - U+10FFFF into 17 "planes" of 2^16 code points each.
        Plane 00,
            containing the original Unicode code points U+0000 through  U+FFFF,
            is called Basic Multilingual Plane (BMP),
        the others
            supplementary planes.

        Characters in the BMP are special,
        because  they can be represented by
        a single 16-bit value.
        Unicode 3.1 was the 1st Unicode Standard to include characters outside the BMP and
        as of Unicode 3.2,
            44,944 characters out of a total of
            95,156 characters
            are located in the supplementary planes. 


   Universal Character Set

        To be universal,
        to include as many characters for as many languages as possible.
        But what languages are used in a heterogenous world?
        What kind of scripts do they use ?
        How should they be prioritized?
        Are they all equal or
            are some more equal than  others?
        The Unicode Consortium was faced with many difficult decisions,
        as language and  characters are often matters of national pride and
        the omission of a single character could lead to the
            boycott联合抵制 of the standard in a whole region.

        Based on thorough research on the topic,
        Unicode was divided into blocks of different sizes and
        the scripts of various languages were allocated to specific blocks.
        This way,
            characters from the same script
            would logically be grouped together and
            would still have space to add new characters
                without disturbing their grouping.
        Fig.  13 shows the major blocks of Unicode and
        their code regions.


        Each block is further divided into smaller sub-blocks for specific scripts and
            contains unused regions,  reserved for future use.

        [UNICODE3] gives detailed description of
            all blocks and
            the characters contained within,
                including a sample glyph for each character.

            U+FFFE and
            U+FFFF
                are not considered as characters and
                are and  will not be used in future revisions of Unicode.

        To discover the endianness字节序 of the system,
            U+FFFE is reserved as
                    the byte-swapped form of
                        U+FEFF (0 WIDTH NON-BREAK SPACE),
                    also called the Byte-Order-Mark (BOM).

            U+FFFF can be used by
                applications to signal errors or
                a non-character value.


   3.4 Logical Order 
        In Unicode,
            characters are exclusively stored in logical order.
        That is the order
            the characters are read and
            not necessary the order they are displayed on screen or  printed.
        Some characters like Latin,  Greek and  Cyrillic characters
            are written left-to-right while others like Arabic and
        Hebrew are written right-to-left.

        Each Unicode character has it's written direction as a property to aid in proper graphical rendering.
        Additionally there are invisible control-characters that explictly mark a direction change in case of bi-directional text where the direction change might be ambiguous.


   Properties

        Unicode characters have well-defined semantics that
        are specified through
            Character Properties.
        The properties operations like
            parsing and 
            sorting as well as
            other algorithms
                that need to have semantic knowledge about the characters.
        Some properties are normative and
        some are only informative.
            For normative(强制性) properties,
                applications conforming to the Unicode standard
                must react if they encounter a character having such an property.
            For informative(指导性) properties,
                it is up to the application whether to honor them or not.
        Below is a small list of the most important properties and  their description.
        It does not include all normative properties,
        [UNICODE3] Chapter 4 lists all properties and
        their status as well as full descriptions.

            Alphabetic
                Set if an character is phonetic
                and not ideographic.

            Case
                Some phonetic alphabets have 2 variants of the same character, like "A" and "a".

            Directionality
                Text direction for characters before and after this character,
                as mentioned in 3.3.4

            Numeric Value
                Characters representing numbers
                have a value
                so that they can be used for arithmetic purposes

            Surrogate
                Whether a character is part of a surrogate pair

            Decomposition
                If the same character can be represented by
                    the use of 2 other characters,
                    specifies it's decomposition rule. (see. 3.3.6)

            Mirrored
                Character has different look depending whether it appears
                    in a left-to-right or right-to-left context, like brackets.

   Dynamic Composition and Decomposition
        Unicode not only includes accented characters like Ü,  Ç, ǻ,
        it also has a mechanism to dynamically create such composed characters by
        combining a single base character with a arbitrary number of combining character.
        Not only can accented characters,
            that are already included in Unicode
            be emulated this way,
        also new characters can be composed.
        Fig.  15 shows some examples.

            a + ̈ ⇒ ä
            C + ̧ ⇒ Ç
            a + ̊ + ́ ⇒ ǻ
        Fig. 15: Dynamic Composition


        Now,
        with Dynamic Composition,
        there are several ways to encode a single character.
        For example,
            the above mentioned ǻ character could be encoded as
                                    ǻ,
                                    å + ́ or even 
                                    a + ̊ + ́,
            This makes searching and
            sorting of text very difficult.

        In order to solve this ambiguity,
            characters that can be dynamically composed
            have a decomposition mapping,
            defining how a character can be dissolved apart into it's basic parts.
        Using the decomposed,  canonic form as in-memory representation,
        searching and  sorting becomes simple again.

   Unification

        All West European languages
        as well as some African and
        South Asian languages
            use the Latin alphabet as common script
            together with their individual extentions,
                usually accented characters.
        The same visual character
        might be pronounced differently in those languages,
            but it is still the same character.
        To reduce the number of characters and  redundancy,
        characters with the same appearance have been been unified and
        allocated only a single code point.

        Unification is obvious in the case of the Latin alphabet,
        but there are many uncertain unification candidates.
        For example,
            the comma character is mainly used in as thousands-separator in English
            but is used as decimal-separator in French,
            but there is only one Latin COMMA character.
        Unicode does not differentiate on usage
        but on only on appearance.
        As another example,
            the unit symbols for seconds,  fetes and  prime
            have been unified as the PRIME character (′).
        But there are also exceptions like
            the Greek Omega character and
            the Ohm symbol for electrical resistance.
        These haven't been unified because
        of legacy compatibility and
        their totally different semantic.

        The area in most need of unification
            were the CJK ideographs.
        Sharing common roots,
        many of them have similar or
        even same visual appearance.
        With unification,
        the more than  130,000 ideographs present in legacy character sets,
        have been reduced to less than  30,000.

        But sometimes,
        the characters have evolved differently and
        look slightly different.
        Fig.  13 shows 2 such ideographs,
            U+6D77 ("ocean")
        and U+76F4 ("straight")
        and their appearances in Traditional Chinese,
            Simplified Chinese,
            Korean and
            Japanese.
        For users of the Latin alphabet,
            the differences seem subtle,
        but for the actual users of the languages the difference is very big.
        The difference is not an ¿glyph¿ problem,
            but of the actual ¿shape¿ of the character.
        If a Japanese student would write the Korean variant of an ideograph in an ideograph-exam,
            he'd fail it.
        In some cases U+6D77,
            Japanese readers would probably able to
                guess the meaning of the Chinese and  Korean variants
                but in other cases like U+76F4,
        the Chinese variant would be impossible to understand for a Japanese and  vice versa.

        Still they have been ¿aggressively unified¿,
        the only exceptions being those cases where
            a legacy character sets
                differentiated between  variants as separate characters.


        As Unicode did not include a mechanism to specify the language of a text,
        applications had to depend on higher-level protocols to help them
        decide on the ¿correct rendering¿ of the characters,
        as the "lang" attribute in XML ([XML],  2.12).
        Because of this problem with variants,
            there also can't be a single ¿Unicode font¿ that would cover all characters and  languages.
        Fig.  17 shows the same Unicode character rendered differently according to the lang Attribute of HTML.
        (browser and local font dependent).



        Unification has been the source of much  chaos
            since  the early days of Unicode.
        First revisions of Unicode were practically ¿useless for East Asian¿ countries
            because  of overzealous unification to
            fit as much as possible into the 16-bits.
        Newer versions of Unicode provide room for more characters so that
            variants as well as ideographs that had been previously missing
                are being continously added to Unicode's repertoire. 


   Surrogate Pairs
        Though the number of characters had been greatly reduced through
        unification work,
        Unicode designers soon had to realize  that
            the number of characters included in Unicode were simple not enough.
        Even the wast number of 65,536 characters was insufficient.
        But Unicode being an 16-bit fixed-length,
            how could a characters with a number higher than
            U+010000 be represented
            without fundamental changes ?


        For this purpose,2048 code points have been reserved as "surrogates" beginning with Unicode 2.0.
        1024 are designated "High-surrogate",
        1024 are "Low-surrogates".
        These are not characters themselves,
            but by
            combining one high and
            one low surrogates,
            called a "surrogate pair",
        they represent a single Unicode character together.
        A simple algorithm is used to
        calculate the actual Unicode character number
        from the surrogate pair and  is defined in [UNICODE3],  Sec.  3.7.


        This method is clearly a violation with Unicode's basic philosophy of simplicity,
        because  the surrogates must be handled specially.
        But still,
        the scheme is very well thought out to minimize the negative effect based on
        experiences with legacy character encodings.
        Depending on application support,
            a surrogate pair is shown as: 
                if the application knows nothing about surrogates: 
                    2 unknown characters,
                else
                    1  character
        Because a high-surrogate is always followed by  a low-surrogate and
        the encoded character is not dependant on any other values before or  after the pair,
            character boundaries are obvious in a sequence of pairs.

        Should a character stream be interrupted,
            the maximum damage is limited to a single character. 

        With the introduction of surrogate pairs,
        the potential numbers of characters that can be included in Unicode increased 17-fold.


Unicode and ISO/IEC 10646


        ISO-10646
            "Information technology -- Universal Multiple-Octet Coded Character Set (UCS)" 
                has the same basic goals as Unicode,
                to create an Universal Character Set.
            An "octet" is ¿ISO¿'s term for an 8-bit byte.

        The two standards have agreed to share the same character repertoire and
        character numbering so that
            both standards are ¿character-by-character equal¿ in their character sets.

        Usually Unicode revisions are published more frequently due to
            the administrative overhead of ISO standards 
        but both organizations have agreed to synchronize as often as possible.

        The advantage of collaboration for Unicode:
            Many national standards
                don't allow ¿industry standards¿ like Unicode  to be referenced
                but allow ISO standards.

        For ISO on the other hand,
            Unicode has the computing industry's support and
            compatibility with it guarantees industry acceptance and  feedback.

        In difference to Unicode,
        ISO-10646 doesn't limit itself to 16-bits.
        ISO-10646 is a 4-octet (32-bit)  character set capable of including more than  2*10^9 characters (the highest bit is not used).

        It is organized into
            128 groups each containing
            256 planes  that again include
            256 rows with 256 cells each.

        Plane 0x00 of Group 0x00 is called the Basic Multilingual Plane (BMP)
        and has exactly the same size and  code points
            as Unicode's BMP.

        Planes 0x01 to 0x10
            contain the characters from Unicode's additional 16 planes
            that are represented using surrogates in Unicode.
        Though ISO-10646 could contain much much more characters than  Unicode,
        the additional groups and  planes
        are currently reserved for future use and
        no characters can be defined there
        to maintain compatibility with Unicode ¿as long as possible¿.

        ISO-10646's canonical representation
        are UCS-4 and  UCS-2.
            In UCS-4,
                a character's ¿number¿ is encoded one octet each for
                    group,
                    plane,
                    row and
                    cell number.

            UCS-2
                If a text only contain characters from the BMP,
                        the group and  plane octets can be omitted. 
                        (2-octet representation) 

                If an Unicode text is interpreted as UCS-2,
                    all characters above U+FFFF will be lost,
                    because ISO-10646 doesn't have the surrogate mechanism.aaa

Going deeper
    https://www.w3.org/International/talks/0505-unicode-intro/


























